Throughout this notebook, we will follow the same approach. We begin with Data Loading, where we import and prepare our necessary datasets. We then move to Data Manipulation, where we process and transform our data into the appropriate format for analysis. Finally, in the Visualization section, we create plots to effectively communicate our findings.
First we load the needed packages:
# Package Loading
# Spatial data handling
library(sf) # Simple Features for spatial vector data
library(spData) # Spatial datasets
library(terra) # Raster data handling
library(exactextractr) # Precise spatial averaging
library(gdistance) # Compute minimum distances on raster surfaces
# Additional utilities
library(units) # Unit handling and conversion
library(lwgeom) # Lightweight geometry operations
library(gridExtra) # Arranging multiple plots
library(readxl) # Excel file reading
# Core data manipulation and visualization
library(tidyverse) # Data manipulation and visualization suite
library(ggplot2) # Advanced plotting (included in tidyverse)
The analysis uses two main data sources:
SPEI (Standardized Precipitation Evapotranspiration Index) data:
US Geographic Data:
To analyze the SPEI (Standardized Precipitation Evapotranspiration Index) trends across US regions over the past 50 years, we need the NetCDF file (.nc) containing SPEI raster data and the US geographic data from spData package. The .nc file with data about the SPEI index has been downloaded from this website.
# Import SPEI index
spei_index <- rast("data/ex1/spei01.nc")
# Import US regions
us_states <- spData::us_states
We first retrieve all dates from the SPEI index, then select only dates between 1965-2015 (our 50-year analysis period, limited by data availability).
# Get dates from SPEI dataset
dates <- time(spei_index)
# Filter only for years between 1965 and 2015
valid_dates <- which(year(dates) >= 1965 & year(dates) <= 2015)
# Extract the subset of SPEI data for our time period of interest
spei_subset <- spei_index[[valid_dates]]
# Remove from 'dates' the unnecessary dates
dates <- dates[valid_dates]
We implement and compare two methodological approaches. The zonal
statistics step computes the average SPEI for each state, creating a
wide-format panel where each row is a state and each column represents a
specific date (month and year). We then transform this into a
long-format panel, adding state and region identifiers along with
separated year and month columns. This restructuring enables us to
efficiently compute yearly means, first at the state level and then
aggregated to the regional level. However, this aggregation creates a
mean-of-means approach. Below we present a second approach
in which we calculate the mean for each region directly without the
route via the states. The separation of temporal components (year and
month) facilitates these temporal aggregations while maintaining the
geographic hierarchy (states nested within regions) for spatial
analysis.
This approach:
# Zonal statistics to compute the mean (each row is a country and each column is a date)
zonal_stats <- exact_extract(
spei_subset,
us_states,
fun = "mean"
)
## | | | 0% | |= | 2% | |=== | 4% | |==== | 6% | |====== | 8% | |======= | 10% | |========= | 12% | |========== | 14% | |=========== | 16% | |============= | 18% | |============== | 20% | |================ | 22% | |================= | 24% | |=================== | 27% | |==================== | 29% | |===================== | 31% | |======================= | 33% | |======================== | 35% | |========================== | 37% | |=========================== | 39% | |============================= | 41% | |============================== | 43% | |=============================== | 45% | |================================= | 47% | |================================== | 49% | |==================================== | 51% | |===================================== | 53% | |======================================= | 55% | |======================================== | 57% | |========================================= | 59% | |=========================================== | 61% | |============================================ | 63% | |============================================== | 65% | |=============================================== | 67% | |================================================= | 69% | |================================================== | 71% | |=================================================== | 73% | |===================================================== | 76% | |====================================================== | 78% | |======================================================== | 80% | |========================================================= | 82% | |=========================================================== | 84% | |============================================================ | 86% | |============================================================= | 88% | |=============================================================== | 90% | |================================================================ | 92% | |================================================================== | 94% | |=================================================================== | 96% | |===================================================================== | 98% | |======================================================================| 100%
# Convert the wide format panel to long format panel and add state/region identifiers
# This is necessary to compute the yearly mean per country and then the yearly mean per region
spei_long <- zonal_stats %>%
as.data.frame() %>%
# Convert from wide to long format
pivot_longer(
cols = everything(),
names_to = "time_index",
values_to = "spei"
) %>%
# Add state and region identifiers, plus temporal information
mutate(
state = rep(us_states$NAME, each = length(dates)),
region = rep(us_states$REGION, each = length(dates)),
date = rep(dates, times = nrow(us_states)),
year = year(date),
month = month(date)
) %>%
select(state, region, year, month, spei)
# Compute the yearly mean for each state
spei_state_annual <- spei_long %>%
group_by(state, region, year) %>%
summarise(
mean_spei_state = mean(spei, na.rm = TRUE),
.groups = "drop"
)
# Compute the yearly mean for each region
spei_regional <- spei_state_annual %>%
group_by(region, year) %>%
summarise(
mean_spei_region = mean(mean_spei_state, na.rm = TRUE),
.groups = "drop"
)
After these manipulations, we can create a plot that shows the temporal trends of SPEI index for each region (colored lines) and the overall smoothed trend with confidence interval (blue line).
# Plotting
ggplot(spei_regional, aes(x = year, y = mean_spei_region, color = region)) +
geom_line() +
geom_smooth(
data = spei_regional, aes(x = year, y = mean_spei_region),
method = "loess", color = "blue", se = TRUE
) +
scale_color_manual(values = c("red", "green", "turquoise", "purple")) +
scale_x_continuous(
breaks = seq(1970, 2010, by = 10),
expand = c(0.02, 0.02)
) +
scale_y_continuous(
limits = c(-1, 1),
breaks = seq(-1, 1, by = 1)
) +
labs(
title = "SPEI Index by US Region (1965-2015) - mean-of-means approach",
x = "Year",
y = "SPEI Index",
color = "Region"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
)
Now we will use the direct mean approach by calculating means immediately on the region level across years. Key advantages:
# Zonal statistics directly by region (grouping states by region first)
regions_sf <- us_states %>%
group_by(REGION) %>%
summarise(geometry = st_union(geometry))
# Calculate SPEI means directly for regions
zonal_stats_regional <- exact_extract(
spei_subset,
regions_sf,
fun = "mean"
)
## | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
# Convert to long format with regional identifiers
spei_regional_direct <- zonal_stats_regional %>%
as.data.frame() %>%
mutate(region = regions_sf$REGION) %>%
pivot_longer(
cols = -region,
names_to = "time_index",
values_to = "spei"
) %>%
mutate(
date = rep(dates, times = nrow(regions_sf)),
year = year(date)
) %>%
group_by(region, year) %>%
summarise(
mean_spei_region = mean(spei, na.rm = TRUE),
.groups = "drop"
)
Next we plot the results and though the differences to our previous approach are minimal, the direct mean approach is more efficient and should be more what we are looking for:
# Plotting
ggplot(spei_regional_direct, aes(x = year, y = mean_spei_region, color = region)) +
geom_line() +
geom_smooth(
data = spei_regional_direct, aes(x = year, y = mean_spei_region),
method = "loess", color = "blue", se = TRUE
) +
scale_color_manual(values = c("red", "green", "turquoise", "purple")) +
scale_x_continuous(
breaks = seq(1970, 2010, by = 10),
expand = c(0.02, 0.02)
) +
scale_y_continuous(
limits = c(-1, 1),
breaks = seq(-1, 1, by = 1)
) +
labs(
title = "SPEI Index by US Region (1965-2015)",
x = "Year",
y = "SPEI Index",
color = "Region"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
)
The plots show:
While both approaches yield similar results due to the linear nature of averaging, the direct regional calculation is methodologically preferred for spatial climate data analysis. In comparison the first graph to be replicated looked as follows:
To calculate the distances between the 10 top populated cities in Spain, and represent the roads within the continental Spain, we used data from the Natural Earth website. Precisely, we loaded: - ne_10m_populated_places shapefile for the most populated cities - ne_10m_roads for the streets
# Load populated cities and roads
places <- st_read("data/ex2/ne_10m_populated_places")
## Reading layer `ne_10m_populated_places' from data source
## `/Users/noemilucchi/Desktop/CodeTerm2/Geospatial/homeworks/bse_geospatial/hw3/data/ex2/ne_10m_populated_places'
## using driver `ESRI Shapefile'
## Simple feature collection with 7342 features and 137 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -179.59 ymin: -90 xmax: 179.3833 ymax: 82.48332
## Geodetic CRS: WGS 84
roads <- st_read("data/ex2/ne_10m_roads")
## Reading layer `ne_10m_roads' from data source
## `/Users/noemilucchi/Desktop/CodeTerm2/Geospatial/homeworks/bse_geospatial/hw3/data/ex2/ne_10m_roads'
## using driver `ESRI Shapefile'
## Simple feature collection with 56600 features and 31 fields
## Geometry type: MULTILINESTRING
## Dimension: XY
## Bounding box: xmin: -166.5325 ymin: -55.11212 xmax: 178.4191 ymax: 71.17768
## Geodetic CRS: WGS 84
From the R package spData, we use the world dataset to extract the geometry of Spain, filtering only for continental Spain. Then, from the dataset containing the most populated cities, we extract only the cities in Spain, sort them in decreasing order of population, and retain only the top 10. Finally, from these top 10 cities, we extract Madrid and Vigo, which will be used in the plot.
# Filter for continental Spain only
spain <- world %>%
filter(name_long=='Spain') %>%
st_cast("POLYGON") %>%
slice_max(st_area(.))
# Get top 10 most populated Spanish cities
spain_top_10_cities <- places %>%
filter(ADM0NAME == "Spain") %>%
arrange(desc(POP_MAX)) %>%
slice_head(n = 10)
# Filter Madrid and Vigo for the plot
madrid_vigo <- spain_top_10_cities %>%
filter(NAME %in% c("Madrid", "Vigo"))
First, we transform the CRS from EPSG:4326 to EPSG:3035, as the former uses latitude and longitude, which would lead to incorrect distance calculations in kilometers. Next, we intersect the road network with Spain’s boundaries to restrict the analysis to roads that are actually within Spain. We then create an empty raster over Spain, set a high resolution, and define the CRS as EPSG:3035 for the same reason mentioned earlier. After that, we generate the friction surface by assigning a value of 1 to cells that contain roads and 100 to cells that do not. This ensures that areas without roads have higher friction, influencing the computation of the shortest path between cities. Subsequently, we compute the transition matrix and, finally, calculate and print the distance matrix, which includes all pairwise distances between the top 10 most populated Spanish cities.
# Transform to planar EPSG:3035 for better distance calculation in km
spain_planar <- st_transform(spain, 3035)
roads_planar <- st_transform(roads, 3035)
cities_planar <- st_transform(spain_top_10_cities, 3035)
# Intersect roads with Spain's boundaries
spain_roads <- st_intersection(roads_planar, spain_planar)
# Create and prepare a raster on the top of Spain
raster <- rast(ext(spain_planar), resolution = 5000, crs = "EPSG:3035")
roads_vect <- vect(spain_roads)
road_raster <- rasterize(roads_vect, raster, field = 1)
# Create friction surface: we assign 100 to cells without streets and 1 with cells
# with streets. In this way we increase the friction for areas without streets, and so
# streets will be chosen when computing the distances between places.
road_values <- values(road_raster)
road_values[is.na(road_values)] <- 100
road_values[!is.na(road_values)] <- 1
values(road_raster) <- road_values
# Create transition matrix for distance calculation
tr_matrix <- transition(raster(road_raster),
transitionFunction = mean,
directions = 8)
# Calculate distances between all city pairs
n_cities <- nrow(cities_planar)
dist_matrix <- matrix(0, n_cities, n_cities)
rownames(dist_matrix) <- cities_planar$NAME
colnames(dist_matrix) <- cities_planar$NAME
for(i in 1:n_cities) {
for(j in 1:n_cities) {
if(i < j) {
path <- shortestPath(tr_matrix,
origin = as_Spatial(cities_planar[i,]),
goal = as_Spatial(cities_planar[j,]),
output = "SpatialLines")
path_sf <- st_as_sf(path)
st_crs(path_sf) <- st_crs(cities_planar)
# Calculate distance in kilometers
dist <- as.numeric(st_length(path_sf)) / 1000
dist_matrix[i,j] <- dist
dist_matrix[j,i] <- dist}}}
# Print the distances
print("Distances between cities (km):")
## [1] "Distances between cities (km):"
print(round(dist_matrix, 1))
## Madrid Barcelona Seville Bilbao Valencia Zaragoza Málaga Murcia
## Madrid 0.0 693.3 403.8 397.3 354.0 356.4 543.2 407.6
## Barcelona 693.3 0.0 1026.9 549.4 331.1 345.2 864.7 469.5
## Seville 403.8 1026.9 0.0 809.4 708.3 710.7 174.7 600.5
## Bilbao 397.3 549.4 809.4 0.0 602.3 283.7 957.0 829.8
## Valencia 354.0 331.1 708.3 602.3 0.0 335.2 529.5 215.0
## Zaragoza 356.4 345.2 710.7 283.7 335.2 0.0 681.6 537.8
## Málaga 543.2 864.7 174.7 957.0 529.5 681.6 0.0 417.6
## Murcia 407.6 469.5 600.5 829.8 215.0 537.8 417.6 0.0
## Granada‎ 491.6 757.8 277.4 901.3 422.6 634.2 106.9 310.7
## Vigo 517.8 1086.6 691.1 678.3 896.6 853.5 884.1 818.9
## Granada‎ Vigo
## Madrid 491.6 517.8
## Barcelona 757.8 1086.6
## Seville 277.4 691.1
## Bilbao 901.3 678.3
## Valencia 422.6 896.6
## Zaragoza 634.2 853.5
## Málaga 106.9 884.1
## Murcia 310.7 818.9
## Granada‎ 0.0 820.2
## Vigo 820.2 0.0
After these manipulations, we can plot continental Spain with its roads, highlighting Madrid and Vigo, as shown in the reference plot.
# Plotting
ggplot() +
geom_sf(data = st_transform(spain, 4326), fill = "lightgray") +
geom_sf(data = st_transform(spain_roads, 4326), color = "black", size = 0.3) +
geom_sf(data = st_transform(spain_top_10_cities, 4326), size = 1, color = "black") +
geom_sf(data = st_transform(madrid_vigo %>% filter(NAME == "Madrid"), 4326),
aes(shape = "Madrid", color = "Madrid"), size = 3) +
geom_sf(data = st_transform(madrid_vigo %>% filter(NAME == "Vigo"), 4326),
aes(shape = "Vigo", color = "Vigo"), size = 3) +
scale_color_manual(name = "Cities",
values = c("Madrid" = "red", "Vigo" = "turquoise")) +
scale_shape_manual(name = "Cities",
values = c("Madrid" = 16, "Vigo" = 17))
In comparison the first map looked as follows:
…
and the second graph as follows: